Homework 07: Pairs Plots, Contour Plots, Heat Maps, Distances

Problem 1: Pick Final Project Teams (10 points)

As discussed in class and the syllabus, you will be working on final projects throughout the rest of the semester, to be submitted during finals week. See here for many examples of last semester’s finals projects.

For this week, all you have to do is pick your final project team. All teams should consist of 3-4 students. You have two options for picking a team:

  1. Amongst yourselves, organize a team of 3-4 students that you would like to work with for the final project. Once your team has assembled, email me (as a team) at telling me your team.

  2. Email me at telling me that you would like to be randomized to a team. If you find only one other student you’d like to work with, you (as a pair) can email me telling me you (as a pair) would like to be randomized to a team.

All you have to do to get the 10 points for this part is email me by class on Wednesday, October 27 with this information. Thus, you have a bit more time after the homework deadline to do this, but the sooner the better. This should be an easy 10 points!

Problem 2: Pairs Plots (21pts)

In this problem, we will use a dataset on students’ academic performance, found here:

data = read.csv("https://raw.githubusercontent.com/zjbranson/315Fall2021/master/students.csv")

Details about the dataset are found here. However, the main things you need to know about this dataset are:

  • Students’ Grade is classified as Low (L), Medium (M), or High (H).
  • Covariates: There are 15 variables on student characteristics and behaviors, 4 of which are quantitative.
  1. (10pts) First, create a subset of the data, called data.subset, which contains only the following variables:
  • RaisedHands
  • VisitedResources
  • AnnouncementsView
  • Discussion
  • Gender
  • Grade

After you’ve made data.subset, use the ggpairs function to make a pairs plot of the quantitative variables in data.subset (i.e., the first four variables in the above list). Your plot should be a 4x4 grid, 6 of which are scatterplots. Don’t worry about changing the title/labels.

#creating the desired subset
data.subset = subset(data,
    select = c("RaisedHands", "VisitedResources", "AnnouncementsView", "Discussion", "Gender", "Grade"))
#creating a pairs plot of just quantitative variables
library(GGally)
ggpairs(data = data.subset, columns = 1:4)

After you’ve made your plot, answer the following questions:

  • Which pair of variables has the highest correlation?

VisitedResources and RaisedHands has the highest correlation (0.692).

  • Which pair of variables has the lowest correlation?

VisitedResources and Discussion has the lowest correlation (0.243).

  1. (11pts) Now, create a pairs plot with the following variables:
  • RaisedHands
  • VisitedResources
  • Grade

Also, using the mapping argument, color the pairs plot by the Gender variable. Make sure that there is some transparency in the plot.

ggpairs(data = data.subset, columns = c(1,2,6), mapping = aes(color = Gender, alpha = 0.7))

After you’ve made your plot, answer the following: In 1-3 sentences, describe the distribution of VisitedResources conditional on Grade and Gender.

From our plot in Part C (in particular, the side-by-side boxplot in the middle row), we can see that the distribution of VisitedResources is similar between genders regardless of the level of Grade. However, we can see that High and Medium grades (H and M, respectively) tend to have high values of VisitedResources; meanwhile, Low grades (L) tend to have low values of VisitedResources.

Hint: This question is NOT asking you to describe the distribution of (1) VisitedResources conditional on Grade, and (2) VisitedResources conditional on Gender. Rather, it’s asking you to describe the distribution of VisitedResources conditional on Grade AND Gender (together).



Problem 3: Contour Plots and Heat Maps (33 points)

In this problem, we will continue working with the student dataset from Problem 2.

  1. (10pts) For this part, do the following:
  • Create a scatterplot of RaisedHands and VisitedResources with contour lines added using geom_density2d().
ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources)) + geom_density2d() + geom_point() + 
  labs(x = "Raised Hands", y = "Visited Resources")

  • In class we discussed how contour lines use two bandwidths; geom_density2d() estimates these bandwidths by default. Now, copy-and-paste your above code, but make the bandwidth smaller by setting h = c(10, 10) within geom_density2d().
ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources)) + geom_density2d(h = c(10,10)) + geom_point() + 
  labs(x = "Raised Hands", y = "Visited Resources")

  • Compare and contrast the two plots in 1-3 sentences.

The first plot consists of 2-3 modes, which are divided into many smaller modes in the second plot. For example, in the second plot, the top-right of the scatterplot is still captured by one large mode (similar to the first plot), but that mode is now divided into 2-4 smaller modes. There are also very many “small islands” of modes throughout the plot, denoting very small clusters of points that are similar in terms of these two variables (which we don’t see in the first plot). More generally, making the bandwidth smaller emphasizes many small modes within the data, similar to what we saw with kernel density smoothing for one quantitative variable.

  1. (13pts) Similar to Part A, again make a scatterplot of RaisedHands and VisitedResources with contour lines, but with the following changes:
  • Make the bandwidth of the contour lines larger by setting h = c(80, 80) within geom_density2d()
  • Set the color of the points according to Grade and the shape of the points according to Gender.
ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources)) + geom_density2d(h = c(80, 80)) +
geom_point(aes(color = Grade, shape = Gender)) + 
  labs(x = "Raised Hands", y = "Visited Resources")

After you’ve made your plot, answer the following two questions:

  • How many modes are there in the scatterplot? In your answer, also characterize/describe each mode in terms of RaisedHands and VisitedResources.

From this plot, we can see that there are two modes (one in the bottom left of the plot, and one in the top right of the plot). These two modes can be considered students with “limited engagement” and “a lot of engagement,” respectively (i.e., the first mode denotes students with low levels of RaisedHands and VisitedResources, and the second mode denotes students with high levels of these variables).

  • In 1-3 sentences, characterize/describe each mode in terms of Grade and Gender.

It seems like the students with “limited engagement” (i.e., students in the bottom left mode) tend to be male and have Low (L) or Medium (M) grades. Meanwhile, students with “a lot of engagement” (i.e., students in the top right mode) tend to have Medium (M) or High (H) grades; there also appears to be about an equal distribution of genders in this mode.

[As a sidenote: Note that the above is the correct plot. It is less informative to put color and shape within ggplot() itself (see below); this will create a separate set of contour lines for each color and shape combination.]

ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources, color = Grade, shape = Gender)) + geom_density2d(h = c(80, 80)) +
geom_point() + 
  labs(x = "Raised Hands", y = "Visited Resources")

  1. (10pts) For this part, you’ll have to make two different heat maps (and all you’ll need to do is turn in the two heat maps). Please do the following:
  • Make a heat map of RaisedHands and VisitedResources with points added but no contour lines (using the default bandwidth) with stat_density2d. Furthermore, change the default colors using scale_fill_gradient() and setting the low and high arguments in that function. Be sure that you use geom_point() after you use stat_density2d (otherwise, you won’t be able to see the points).
ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources)) +
stat_density2d(aes(fill = ..density..), geom = "tile", contour = F) +
scale_fill_gradient(low = "white", high = "red")  +
geom_point() + 
  labs(x = "Raised Hands", y = "Visited Resources")

  • Now you’re going to alter your above heat map such that it uses three colors instead of two. Copy-and-paste your above code, and then follow these steps:
  1. Change scale_fill_gradient() to scale_fill_gradient2().
  2. Within scale_fill_gradient2(), specify a “medium density” color using the mid argument (similar to the low and high arguments).
  3. Within scale_fill_gradient2(), there is an argument called midpoint that specifies what a “medium density” is. The default is 0, which doesn’t make sense for densities, because 0 is the lowest possible value for densities. So, set midpoint equal to a non-zero number that you think makes sense, given the range of densities you saw in your previous heat map.

You should end up with a heat map that looks kind of cool, or at least cooler than the two-color heat map.

ggplot(data = data.subset, aes(x = RaisedHands, y = VisitedResources)) +
stat_density2d(aes(fill = ..density..), geom = "tile", contour = F) +
scale_fill_gradient2(low = "white", mid = "blue", high = "red", midpoint = 0.00015)  +
geom_point() + 
  labs(x = "Raised Hands", y = "Visited Resources")

Hint: For the midpoint argument, your graph should be a gradient of three different colors that you’ve specified. If this isn’t the case, you may have specified midpoint poorly.

Problem 4: Olive Distances (36 points)

We’ll again work with the olive oils dataset used in Lab7. The dataset can be found here and more information about the data can be found here.

Here is the code to define the olive dataset:

olive = read.csv("https://raw.githubusercontent.com/zjbranson/315Fall2021/main/olive.csv")
  1. (10pts) Plot the \(k = 2\) dimensions from multi-dimensional scaling (MDS) on a scatterplot. When doing this, use the Euclidean distance, and only use the quantitative variables (i.e., not area or region) to compute that distance. Remember to standardize your variables.
# first, grab just the quantitative variables
olive.subset = subset(olive, select = -c(area, region))

# standardize the variables
olive.subset = apply(olive.subset, MARGIN = 2, FUN = function(x) x/sd(x))
olive.subset = as.data.frame(olive.subset)

#compute the distance matrix
dist.olive = dist(olive.subset)

#run MDS
mds.olive = cmdscale(dist.olive, k = 2)
#add MDS dimensions to the dataset
olive$mds1 = mds.olive[,1]; olive$mds2 = mds.olive[,2]

#scatterplot of MDS dimensions
ggplot(olive, aes(x = mds1, y = mds2)) + geom_point()

  1. (10pts) Now copy-and-paste the graph you made in Part A, and make the following changes to your graph:
  • Add something to the graph such that you can determine how many modes there are in your Part A scatterplot.

  • Color the points by area.

ggplot(olive, aes(x = mds1, y = mds2)) + geom_point(aes(color = area)) + geom_density2d()

After you’ve made your plot, answer the following questions:

  • How many modes would you say there are in the scatterplot? Explain your answer in 1-2 sentences.

It appears that there are three main modes (left center, middle center, and right center) as well as two minor modes (bottom center and top left).

  • Summarize the takeaways from this graphic in 1-3 sentences.

The three main modes in the data clearly correspond to the three different areas in the dataset: The left-center mode corresponds to Northern oils, the middle-center mode corresponds to Sardinian oils, and the right-center mode corresponds to Southern oils. Meanwhile, the top-left minor mode corresponds to Northern oils, while the bottom-right minor mode corresponds to Southern oils. Thus, Sardinian oils appear to be tightly clustered, while Northern and Southern oils are also quite clustered but to a less concentrated degree. This in intuitive: Sardinia is an island, while North and South are more general areas of Italy, and thus we would expect there to be more heterogeneity in the oils that come from those regions.

  1. (8pts) As of now, the MDS dimensions don’t have a lot of interpretability. Thus, it can be useful to see how the MDS dimensions relate to the original data. Let’s focus on the first dimension returned by the MDS that you ran in Part A; call this \(MDS_1\). For this part, do the following:
  • First, run a linear regression with \(MDS_1\) as the outcome and the quantitative variables in the dataset as the covariates (don’t include interactions). Be sure you don’t include area, region, or \(MDS_2\) (the second dimension from MDS) in your regression.
summary(lm(mds1~.-mds2-area-region, data = olive))
## 
## Call:
## lm(formula = mds1 ~ . - mds2 - area - region, data = olive)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.888e-13 -5.200e-16  5.600e-16  1.590e-15  3.104e-14 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  2.093e+00  3.013e-13  6.946e+12   <2e-16 ***
## palmitic     2.733e-03  3.287e-17  8.314e+13   <2e-16 ***
## palmitoleic  8.577e-03  4.036e-17  2.125e+14   <2e-16 ***
## stearic     -2.685e-03  3.777e-17 -7.108e+13   <2e-16 ***
## oleic       -1.218e-03  2.999e-17 -4.061e+13   <2e-16 ***
## linoleic     1.506e-03  2.945e-17  5.114e+13   <2e-16 ***
## linolenic    1.689e-02  9.267e-17  1.822e+14   <2e-16 ***
## arachidic    1.036e-02  5.178e-17  2.002e+14   <2e-16 ***
## eicosenoic   2.214e-02  8.485e-17  2.610e+14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.659e-14 on 563 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 9.648e+29 on 8 and 563 DF,  p-value: < 2.2e-16
  • Answer the following: Which variables are positively associated with \(MDS_1\) to a statistically-significant degree, and which are negatively associated with \(MDS_1\) to a statistically-significant degree?

From the above, all of the variables are significantly related with \(MDS_1\) (all of the \(p\)-values are quite small). In particular, the following variables are positively associated with \(MDS_1\):

  • palmitic
  • palmitoleic
  • linoleic
  • linolenic
  • arachidic
  • eicosenoic

And these variables are negatively associated with \(MDS_1\):

  • stearic
  • oleic
  1. (8pts) Pick two variables that are positively associated with \(MDS_1\) to a statistically-significant degree and two variables that are negatively associated with \(MDS_1\) to a statistically-significant degree. Produce one graph that effectively does the following:
  • Plots all the pairwise scatterplots for the four variables (there should be six scatterplots), all colored by area.
  • Plots the marginal distribution of each of the four variables, each colored by area.
  • Sets alpha = 0.5 such that there is some transparency in the plot.
library(GGally)
ggpairs(olive, columns = c(3:6), mapping = aes(color = area, alpha = 0.5))

After making your graph, summarize the main takeaways from that graph in 1-4 sentences. In your interpretation, be sure to compare the different areas in terms of each of the four variables you plotted.

We choose palmitic and palmitoleic as the positively-associated variables and stearic and oleic as the negatively-associated variables. Below is a pairs plot colored by area.

By looking at the smoothed density plots along the diagonal, we can see that South has much larger palmitic and palmitoleic than the other two areas (which do not seem to have large differences in these two variables). There does not seem to be large differences in stearic among the three regions, although there is less variance for Sardinia. Meanwhile, South exhibits the lowest oleic values, followed by Sardinia, followed by North. Looking at the scatterplots in the off-diagonal, it appears that North and Sardinia only appear to well-separate based on oleic; meanwhile, South seems well-separated from the other two areas in every respect, but with much more variation in every variable.